Introduction to Video Conferencing

Introduction#

The Internet is designed to allow communication between clients and servers. Typically, the client initiates a request, and the server responds to it. The flow is also similar for real-time applications where users communicate with each other. However, because the request of each user relays through an intermediate server, an additional lag or delay is inevitable. When it comes to video conferencing, the flow may not be optimal, especially when the server is too far away from the user, because it can add delay and lag due to the size of the data and result in glitchy playback.

The client-server model for video conferencing
The client-server model for video conferencing

We know that user-perceived latency mainly depends on the transfer time and processing time. Usually, live streams don’t require a lot of processing, and are forwarded from servers that can be miles away from users. Let’s discuss some available ways to solve this problem of extra miles.

Real-time communication#

Real-time communication requires the shortest path to transmit data, which is possible through peer-to-peer communication. Still, it becomes problematic when participants are behind different Network Address Translation (NAT) areas. To overcome this problem, a signaling server is used that allows both the communicating parties to share their multimedia session descriptions that include the port, IP addresses and  other information essential for the communication. Let's see how signaling helps in sharing this multimedia session information.

Note: A multimedia session is used to identify media-related metadata essential for media transmission, processing, etc. It is also helpful for identifying and enabling device-specific features compatible with other participants.

Points to Ponder

Question 2

Why is peer-to-peer connection not feasible for direct communication between clients originating from different networks?

Hide Answer

Usage of peer-to-peer technology may be possible within the same local area network (LAN). But for users (specifically the ones using IPv4) who are behind a NAT (Network Address Translation (NAT) replaces the private address of the communicating device with the public IP of the LAN gateway) or some other kind of edge devices monitoring and controlling the traffic flow, peer-to-peer connections can be more challenging to maintain. For example, private networks are usually protected by firewalls that block incoming requests for better security and further restrict communication to the client-server model.

Note: There are some workarounds, such as Skype Protocol (not publically available), WebRTC, and so on, which allow peer-to-peer communication over the Internet.

2 of 2

Signaling and connecting#

Signaling refers to the successful initiation of a multimedia session between participants willing to participate in the audio/video conference. However, before the communication starts, clients must exchange and agree on multimedia session information, such as communication addresses (IP and port), media descriptions (text, audio, video, etc.), and other metadata. This information is usually sent via the session description protocol (SDP).

The session description protocol (SDP) is a format for describing session information in a standardized form. It is just a description format, must be delivered using protocols like Session Initiation Protocol (SIP) or Session Announcement Protocol (SAP) , which are specially designed to share session information between participants. Let's take the session initiation protocol (SIP) as an example, due to its versatility, and go over how SDPs are exchanged between different participants.

Session initiation protocol (SIP)#

Session initiation protocol (SIP) is a set of guidelines for peer-to-peer communication to share, maintain, and terminate audio/video conferences. It is not a complete vertical stack, and it usually works with other protocols, such as (RTP, RTSP, HTTP, etc.) to provide a comprehensive service. It uses a network of proxy servers, which help discover participants. Participants can negotiate their sessions using the SDP to establish a connection. It is also helpful for adding and removing participants to an existing session, for example, a multicast conference session.

The following are the commonly used SIP methods used to send requests to the SIP server when initiating a session:

  • REGISTER: This method registers the contact information of users and creates a map of the public URI to the contact information.
  • INVITE: This method initiates a session that eventually reaches the registered user, who can accept or reject the invitation.
  • ACK: This method acknowledges a request to return a status similar to the status code in an HTTP response.
  • CANCEL: This method is used to cancel an initiated request, and the server generates an error response for that request.
  • BYE: This method terminates the current session.
  • OPTIONS: This method is used for querying information from the SIP servers.

The following illustration shows two clients exchanging their session information using signaling (in this case, SIP). Both clients agree to share data (steps 1 and 2), but to exchange actual data (video frames in our case), we need to establish an interactive connection, as described in step 3 in the illustration below:

Session initiation via SIP dialog
Session initiation via SIP dialog

Now let’s discuss the different protocols used for exchanging data between the communicating parties.

Data exchange protocols#

There are several protocols that provide bidirectional data flow directly between two devices (endpoints), namely WebSockets, WebRTC, H323, and so on.. Among these, WebRTC is one of the most popular peer-to-peer video conferencing protocols. It's a protocol stack that works on top of other protocols (SRTP, SCTP, DTLS, etc.). It was originally developed to enable multimedia conferencing directly from the browser. Later, some tools and technologies were added to its stack for native application support on different platforms.

Peer-to-peer communication protocols (like WebRTC) introduce powerful capabilities for applications requiring large data transfer in real-time, but everything has its cost. Let's discuss the limitations in the next section.

Scaling real-time communication#

While we have established that SIP facilitates the exchange of session information and WebRTC for exchanging data, these are still not enough to achieve a scalable solution when it comes to audio/video conferencing. Peer-to-peer connections are great for small groups, say five to ten participants, but when the number increases, a mesh of peer-to-peer connections is created, which is resource intensive for each participant, as shown in the illustration below:

Multicast through peer-to-peer mesh creation
Multicast through peer-to-peer mesh creation

Point to Ponder

Question

Why is the total number of streams in a mesh topology n(n−1)n(n-1) and not 2n(n−1)2n(n-1)?

Hide Answer

We must take care when counting the total number of connections in a network mesh topology, because we can count a connection multiple times when performing a summation of connections per client. For example, when counting the connections associated with Client 1 and Client 2 in the figure above, we might count the direct connection between Client 1 and Client 2 twice, resulting in n×2(n−1)n \times 2(n-1 ) the total number of connections on the network. To avoid this duplication, we can simply divide the result by two to get the actual number.

n×2(n−1)2=n(n−1) \frac {n \times 2(n-1)}2 = n(n-1)

As shown above, the peer-to-peer paradigm is not suitable for scaling systems. Therefore, for larger groups, there is another approach called Multipoint Control Unit (MCU), which receives incoming streams from each client, merges them into one stream according to defined settings, and sends one stream back to each participant. The working of an MCU server is given in the illustration below:

Multicast through multipoint control unit server
Multicast through multipoint control unit server

The approach using the MCU server is not optimal because we need to process the incoming streams before compiling them into one outgoing stream, which can cause considerable lag during a multicast stream with many participants.

Let's now discuss another approach called Selective Forwarding Unit (SFU), which receives incoming streams from each client and selectively forwards them to other participants. For example, if there are nn participants in a video conference, and each client creates nn concurrent streams with the SFU server, then there is one stream for uploading its own data and (n−1)(n - 1) streams for downloading data from the rest of the participants. This approach works well in cases where each client has a decent amount of bandwidth. The working of the SFU server is shown in the figure below:

Multicast through selective forwarding unit server
Multicast through selective forwarding unit server

Note: Simulcast SFU (SSFU) is an extension of SFU where each participant also sends their streams in different resolutions that they can support. The simulcast SFU then adaptively forwards the stream to each client based on available bandwidth and the maximum resolution they can handle.

From the above discussion, we can conclude that peer-to-peer connection is the shortest route for the data flow, but it creates O(n2)O(n^2) connections for nn active users and may also not scale well, so we must relay streams through a server. The MCU server is light on the network, as it creates O(n)O(n) connections for nn active users. It is also suitable for clients with limited bandwidth, but it may add processing delay for a large number of participants. On the other hand, the SFU server does not require much processing, but may also cause load on the network because it also sends streams in the order of O(n2)O(n^2). Lastly, simulcast SFU (SSFU) can be considered the best compromise between bandwidth and resource utilization, because it does not require much processing and can adapt to network conditions. We will use SSFU to design our Zoom API.

Zoom uses differentiated services field codepoints (DSCP) to prioritize its traffic at the network layer to maintain the quality of service. Refer to RFC 2474 to learn how DSCP works.

Geographically distributed media server#

Companies such as Zoom, Google, and Microsoft, have joint ventures with other companies and geographically distribute their media servers (MCU, SFU, Simulcast, and so on.) to improve their services' performance.

Geographically distributed media servers
Geographically distributed media servers

The majority of users collaborate in their geographical area, and distributing media servers in their local area helps us to achieve reliable video transfer with low latency.

Summary#

The following table summarizes the different techniques and protocols discussed in this lesson:

Name

Type

Description

SDP

Protocol

A standard of multimedia session information

SIP

Protocol

A set of rules to share, maintain, and terminate multimedia sessions

WebRTC

Protocol

The simple and easy peer-to-peer data exchange protocol

H323

Protocol

The complex but fine-tuned peer-to-peer communication

MCU

Relay

Takes an input stream and sends an output stream

Requires computational power to merge incoming streams on the go

SFU

Relay

Takes an input stream and sends multiple output streams

Requires high bandwidth to send multiple high-resolution data streams

SSFU

Relay

Takes multiple input streams and sends multiple output streams

Can adapt to network conditions

In this lesson, we discussed some techniques used for real-time multicasting. In the next lesson, we’ll make some key decisions that will help us to develop an efficient API design for a real-time video conferencing application like Zoom.

Requirements of the Zoom API

Zoom API Design Decisions